Search CORE

14 research outputs found

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Author: C.H. Bischof
G. Quintana-Orti
G.H. Golub
M. Gu
R. Schreiber
S. Chandrasekaran
Z. Drmač
Z. Drmač
Z. Drmač
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).Tom ́as and Bai were supported in part by the U.S. DOES ciDAC grant DOE-DE-FC0206ER25793 and NSF grant PHY1005502. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231.Tomás Domínguez, AE.; Bai, Z.; Hernández García, V. (2013). Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors. En High Performance Computing for Computational Science - VECPAR 2012. Springer Verlag (Germany): Series. 50-58. https://doi.org/10.1007/978-3-642-38718-0_8S5058Bischof, C.H.: A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput. 12, 36–57 (1991)Chandrasekaran, S., Ipsen, I.C.F.: On rank-revealing factorisations. SIAM J. Matrix Anal. Appl. 15, 592–622 (1994)Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 223–231 (2010)Drmač, Z., Bujanović, Z.: On the failure of rank-revealing QR factorization software – a case study. ACM Trans. Math. Softw. 35, 12:1–12:28 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm I. SIAM J. Matrix Anal. Appl. 29, 1322–1342 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm II. SIAM J. Matrix Anal. Appl. 29, 1343–1362 (2008)Golub, G.H.: Numerical methods for solving linear least squares problems. Numer. Math. 7, 206–216 (1965)Gu, M., Eisenstat, S.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)Quintana-Orti, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 53–57 (1989

CiteSeerX

Crossref

RiuNet

Interpolatory methods for $\mathcal{H}_\infty$ model reduction of multi-input/multi-output systems

Author: A.C. Antoulas
A.C. Antoulas
A.C. Antoulas
B. Gustavsen
B. Gustavsen
B.D.O. Anderson
C.A. Beattie
C.A. Beattie
D. Deschrijver
D. Kavranoglu
G.H. Golub
G.M. Flagg
J. Nocedal
K. Glover
K.A. Gallivan
K.A. Gallivan
L.N. Trefethen
P. Benner
S. Gugercin
S. Gugercin
S. Lefteriu
S.J. Wright
T. Mitchell
T. Mitchell
Z. Drmac
Z. Drmač
Z. Ugray
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/10/2016
Field of study

We develop here a computationally effective approach for producing high-quality

\mathcal{H}_\infty

-approximations to large scale linear dynamical systems having multiple inputs and multiple outputs (MIMO). We extend an approach for

\mathcal{H}_\infty

model reduction introduced by Flagg, Beattie, and Gugercin for the single-input/single-output (SISO) setting, which combined ideas originating in interpolatory

\mathcal{H}_2

-optimal model reduction with complex Chebyshev approximation. Retaining this framework, our approach to the MIMO problem has its principal computational cost dominated by (sparse) linear solves, and so it can remain an effective strategy in many large-scale settings. We are able to avoid computationally demanding

\mathcal{H}_\infty

norm calculations that are normally required to monitor progress within each optimization cycle through the use of "data-driven" rational approximations that are built upon previously computed function samples. Numerical examples are included that illustrate our approach. We produce high fidelity reduced models having consistently better

\mathcal{H}_\infty

performance than models produced via balanced truncation; these models often are as good as (and occasionally better than) models produced using optimal Hankel norm approximation as well. In all cases considered, the method described here produces reduced models at far lower cost than is possible with either balanced truncation or optimal Hankel norm approximation

arXiv.org e-Print Archive

Crossref

A GPU-based hyperbolic SVD algorithm

Author: A.H. Sameh
F.T. Luk
F.T. Luk
G.S. Sachdev
H. Zha
I. Slapničar
I. Slapničar
I. Slapničar
J.R. Bunch
K. Veselić
R. Mathias
R.P. Brent
S. Lahabar
S. Singer
S. Singer
S. Singer
S. Zhang
Sanja Singer
V. Hari
V. Hari
Vedran Novaković
Z. Drmač
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm, using a massively parallel graphics processing unit (GPU), is developed. The algorithm also serves as the final stage of solving a symmetric indefinite eigenvalue problem. Numerical testing demonstrates the gains in speed and accuracy over sequential and MPI-parallelized variants of similar Jacobi-type HSVD algorithms. Finally, possibilities of hybrid CPU--GPU parallelism are discussed.Comment: Accepted for publication in BIT Numerical Mathematic

arXiv.org e-Print Archive

CiteSeerX

Crossref

FAMENA Repository

Novel Modifications of Parallel Jacobi Algorithms

Author: A Sluis van der
Aleksandar Ušćumlić
C Ashcraft
FM Dopico
FT Luk
H Zha
I Slapničar
I Slapničar
JR Bunch
JR Bunch
JR Bunch
JR Bunch
JR Bunch
JW Demmel
K Veselić
K Veselić
NH Rhee
NJ Higham
PPM Rijk de
S Singer
Sanja Singer
Saša Singer
V Hari
V Hari
V Hari
Vedran Dunjko
Vedran Novaković
Z Drmač
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/05/2011
Field of study

We describe two main classes of one-sided trigonometric and hyperbolic Jacobi-type algorithms for computing eigenvalues and eigenvectors of Hermitian matrices. These types of algorithms exhibit significant advantages over many other eigenvalue algorithms. If the matrices permit, both types of algorithms compute the eigenvalues and eigenvectors with high relative accuracy. We present novel parallelization techniques for both trigonometric and hyperbolic classes of algorithms, as well as some new ideas on how pivoting in each cycle of the algorithm can improve the speed of the parallel one-sided algorithms. These parallelization approaches are applicable to both distributed-memory and shared-memory machines. The numerical testing performed indicates that the hyperbolic algorithms may be superior to the trigonometric ones, although, in theory, the latter seem more natural.Comment: Accepted for publication in Numerical Algorithm

arXiv.org e-Print Archive

Crossref

FAMENA Repository